Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Async tokenizer #86

Merged
merged 28 commits into from
Jun 18, 2024
Merged

feat: Async tokenizer #86

merged 28 commits into from
Jun 18, 2024

Conversation

miri-bar
Copy link
Contributor

No description provided.

@miri-bar miri-bar requested a review from a team as a code owner June 16, 2024 13:17
README.md Show resolved Hide resolved
self._cache_dir = cache_dir or _DEFAULT_MODEL_CACHE_DIR

async def __aenter__(self):
await self._init_tokenizer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not returning the tokenizer object from this function?

self._tokenizer = await self._init_tokenizer()

Its not good practice to mutate a variable inside a function and depend that it had changed outside of the function

Copy link
Contributor

@Josephasafg Josephasafg Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you need to declare on the _tokenizer in the class's properties like so -

class AsyncJambaInstructTokenizer(BaseJambaInstructTokenizer, BaseTokenizer):
    _tokenizer: Optional[Tokenizer] = None
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I declared it in BaseJambaInstructTokenizer, because it's a variable that both sync and async classes have

Comment on lines 34 to 38
self._id_to_token_map = {i: self._sp.id_to_piece(i) for i in range(self.vocab_size)}
self._token_to_id_map = {self._sp.id_to_piece(i): i for i in range(self.vocab_size)}
self._no_show_tokens = set(
self._convert_ids_to_tokens([i for i in range(self.vocab_size) if self._sp.IsControl(i)])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these lines are already happening in the base class no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.
In base class are all the initializations that don't depend on load_binary, so that both sync and async classes can utilize.
These lines are for the initialization of the members that depend on load_binary (the "heavy" operation, that happens in async in the async class).



class BaseJurassicTokenizer(ABC):
_sp: spm.SentencePieceProcessor = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type is wrong. If something could be None then its type is Optional[<object_type>]

Copy link
Contributor

@Josephasafg Josephasafg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Josephasafg Josephasafg added good first issue Good for newcomers lgtm Looks Good to Me and removed good first issue Good for newcomers labels Jun 18, 2024
@miri-bar miri-bar merged commit 3006cda into main Jun 18, 2024
6 checks passed
@miri-bar miri-bar deleted the async_tokenizer branch June 18, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Looks Good to Me
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants