feat: Async tokenizer #86

miri-bar · 2024-06-16T13:17:31Z

No description provided.

README.md

Josephasafg · 2024-06-17T06:47:00Z

ai21_tokenizer/jamba_instruct_tokenizer.py

+        self._cache_dir = cache_dir or _DEFAULT_MODEL_CACHE_DIR
+
+    async def __aenter__(self):
+        await self._init_tokenizer()


Why are you not returning the tokenizer object from this function?

self._tokenizer = await self._init_tokenizer()

Its not good practice to mutate a variable inside a function and depend that it had changed outside of the function

This means you need to declare on the _tokenizer in the class's properties like so -

class AsyncJambaInstructTokenizer(BaseJambaInstructTokenizer, BaseTokenizer): _tokenizer: Optional[Tokenizer] = None ...

I declared it in BaseJambaInstructTokenizer, because it's a variable that both sync and async classes have

pyproject.toml

ai21_tokenizer/tokenizer_factory.py

Josephasafg · 2024-06-17T07:16:57Z

ai21_tokenizer/jurassic_tokenizer.py

        self._id_to_token_map = {i: self._sp.id_to_piece(i) for i in range(self.vocab_size)}
        self._token_to_id_map = {self._sp.id_to_piece(i): i for i in range(self.vocab_size)}
        self._no_show_tokens = set(
            self._convert_ids_to_tokens([i for i in range(self.vocab_size) if self._sp.IsControl(i)])
        )


All of these lines are already happening in the base class no?

Nope.
In base class are all the initializations that don't depend on load_binary, so that both sync and async classes can utilize.
These lines are for the initialization of the members that depend on load_binary (the "heavy" operation, that happens in async in the async class).

ai21_tokenizer/base_jurassic_tokenizer.py

Josephasafg · 2024-06-17T07:21:38Z

ai21_tokenizer/base_jurassic_tokenizer.py

+
+
+class BaseJurassicTokenizer(ABC):
+    _sp: spm.SentencePieceProcessor = None


Type is wrong. If something could be None then its type is Optional[<object_type>]

ai21_tokenizer/utils.py

examples/use_tokenizer_async.py

examples/async_jamba_tokenizer.py

ai21_tokenizer/base_tokenizer.py

ai21_tokenizer/jamba_instruct_tokenizer.py

ai21_tokenizer/jurassic_tokenizer.py

ai21_tokenizer/tokenizer_factory.py

.github/workflows/test.yaml

ai21_tokenizer/jamba_instruct_tokenizer.py

Josephasafg

LGTM

miri-bar added 4 commits June 16, 2024 09:56

feat: support async, wip

54b8c0d

feat: fix and add tests, examples, update readme

23ecd60

fix: poetry lock

a594d26

fix: anyio -> aiofiles

db4c4e4

miri-bar requested a review from a team as a code owner June 16, 2024 13:17

miri-bar and others added 17 commits June 16, 2024 16:36

fix: try 3.8

b6bd196

fix: remove 3.7 from tests

715961a

fix: poetry lock

db41c47

fix: add 3.7 back

acbc002

fix: poetry lock

1fa7d2f

fix: poetry.lock

63faa9f

ci: pipenv

0b27027

fix: pipenv

15366b8

fix: pipenv

aa87f08

fix: pyproject

172afce

fix: lock

241fe6c

fix: version

abb40da

fix: Removed aiofiles

07b83b5

ci: update python version,

18df1d6

Merge branch 'main' into async_tokenizer

297bc04

fix: switch from aiofiles to anyio, remove redundant comments

0e3ef22

chore: poetry lock

c0930ac