Skip entire header for llama3 decode #1656

RdoubleA · 2024-09-23T18:33:54Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

With skip_special_tokens=True, the llama3 tokenizer still included the role name because it is part of the header but is not a special token. This PR filters out all tokens between start_header_id and end_header_id. Bonus: also expose skip_special_tokens in Phi3MiniTokenizer for consistency.

Before:

user

I can see the sun. But even if I cannot see the sun, I know that it exists.assistant

And to know that the sun is there - that is living.

After:

I can see the sun. But even if I cannot see the sun, I know that it exists.And to know that the sun is there - that is living.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-09-23T18:33:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1656

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit e7bbac6 with merge base 57ab583 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.9, stable) (gh)
tests/recipes/test_eleuther_eval.py::TestEleutherEval::test_eval_recipe_errors_with_qat_quantizer

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

some comments and questions

felipemello1 · 2024-09-24T01:09:44Z

torchtune/models/llama3/_tokenizer.py

+                elif token_ids[idx] == self.end_header_id:
+                    # Mask out end header id and "\n\n" after it, then reset mask
+                    not_header[idx] = False
+                    mask = True


So after we see "end_header_id", mask is set to True, and will never be False again, is that right? If so, can we just exit the while loop? Or do we need to keep checking because of sample packing, for example?

felipemello1 · 2024-09-24T01:12:38Z

torchtune/models/llama3/_tokenizer.py

+            not_header = [True] * len(token_ids)
+            mask = True
+            idx = 0
+            while idx < len(token_ids):


This feels like it would be more efficient with numpy arrays. What sucks is going list -> array -> list

But you could do something like:

indices_start = get_index_start_header(token_ids, start_header_id) indices_end = get_index_start_header(token_ids, start_header_id) for idx_start, index_end zip(indices_start, indices_end): mask[idx_start:index_end+1] = False

that way the python for-loop is over number of header ids, and not over indices

ps: not saying it needs to be done. Just thinking outloud. I usually try to avoid python loops

felipemello1 · 2024-09-24T01:13:21Z

torchtune/models/llama3/_tokenizer.py

-            truncate_at_eos=truncate_at_eos,
-            skip_special_tokens=skip_special_tokens,
-        )
+        if skip_special_tokens:


will this test pass?

text = decode(encode(text), skip_special_tokens=True)

Probably not well formulated, but i want to see if after every encode/decode we are adding \n\n

felipemello1 · 2024-09-24T01:15:35Z

torchtune/models/llama3/_tokenizer.py

+                idx += 1
+
+            return self.tt_model.decode(
+                [token_ids[i] for i, m in enumerate(not_header) if m],


I suggested defining token_ids inside of the if condition. Then, you can have a single return and remove the if/else and duplicated code

eg:

instead of

if something: return self.tt_model.decode(token_ids_a) else: return self.tt_model.decode(token_ids_b)

do:

if something: token_ids = token_ids_a return self.tt_model.decode(token_ids)

felipemello1 · 2024-09-24T01:17:26Z

torchtune/models/llama3/_tokenizer.py

-            truncate_at_eos=truncate_at_eos,
-            skip_special_tokens=skip_special_tokens,
-        )
+        if skip_special_tokens:


I think that a string here is necessary to explain why you need this extra logic for "skip_special_tokens" if decode already has this flag. In other words: why do we skip special tokens tokens in two different places? Is there a more elegant way to solve this, like adding the special token to the tt_model directly?

facebook-github-bot · 2024-10-08T19:18:55Z

@KaiserLeave has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-09T05:47:06Z

@KaiserLeave has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ebsmothers · 2024-10-09T18:28:11Z

torchtune/models/llama3/_tokenizer.py

+            skip_special_tokens=False,
        )
+        if skip_special_tokens:


So I understand that the goal here is just to fix skip_special_tokens for Llama3 tokenizer decode, but it seems to me like we are doing something very unexpected here with skip_special_tokens. (a) we have it defined on the base tokenizer and it's basically a no-op now, and (b) we are now inconsistent on whether this needs to be defined on the ModelTokenizer and BaseTokenizer. If it is a function of tokenize_messages on the ModelTokenizer moreso than the BaseTokenizer, maybe we should update the Protocol along with other callsites?

ebsmothers · 2024-10-09T18:29:33Z

torchtune/models/llama3/_tokenizer.py

+        # We will remove special tokens manually via regex on the decoded string.
+        # This is because removing all special tokens does not remove the role and
+        # whitespace added from the special tokens, i.e., the "user" and "\n\n" in
+        # "<|start_header_id|>user<|end_header_id|>\n\n"


I would maybe move this comment up to where you define self._special_token_regex and self._special_token_header_regex

this is where it actually happens, so makes more sense to keep it here? no strong opinions

Yeah I feel the same, fine to keep it here then

ebsmothers

Looks like the PR is out of sync with the internal diff? Otherwise it looks good to me, stamping so you're unblocked

facebook-github-bot · 2024-10-10T03:04:25Z

@KaiserLeave has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fix

445286a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 23, 2024

RdoubleA requested a review from joecummings September 23, 2024 18:37

felipemello1 reviewed Sep 24, 2024

View reviewed changes

RdoubleA added 2 commits October 8, 2024 13:19

Merge branch 'main' into llama_decode_fix

fe9dccf

use regex

784b527

ebsmothers reviewed Oct 9, 2024

View reviewed changes

RdoubleA added 3 commits October 9, 2024 11:45

Merge branch 'main' into llama_decode_fix

1e8bcf0

remove skip_special_tokens from tiktoken

59268d4

fix tests

e7bbac6

ebsmothers approved these changes Oct 9, 2024

View reviewed changes

RdoubleA merged commit ced1a84 into pytorch:main Oct 10, 2024
17 of 19 checks passed

RdoubleA deleted the llama_decode_fix branch October 10, 2024 20:20

joecummings mentioned this pull request Oct 14, 2024

Llama 3 Tokenizer decode method skip_special_tokens=True should remove headers #1555

Closed

mori360 pushed a commit to mori360/torchtune that referenced this pull request Oct 14, 2024

Skip entire header for llama3 decode (pytorch#1656)

cae01cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip entire header for llama3 decode #1656

Skip entire header for llama3 decode #1656

RdoubleA commented Sep 23, 2024 •

edited

Loading

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading

felipemello1 left a comment

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

facebook-github-bot commented Oct 8, 2024

facebook-github-bot commented Oct 9, 2024

ebsmothers Oct 9, 2024

ebsmothers Oct 9, 2024

RdoubleA Oct 9, 2024

ebsmothers Oct 9, 2024

ebsmothers left a comment

facebook-github-bot commented Oct 10, 2024

Skip entire header for llama3 decode #1656

Skip entire header for llama3 decode #1656

Conversation

RdoubleA commented Sep 23, 2024 • edited Loading

Context

Test plan

UX

pytorch-bot bot commented Sep 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1656

❌ 1 New Failure

felipemello1 left a comment

Choose a reason for hiding this comment

felipemello1 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Oct 8, 2024

facebook-github-bot commented Oct 9, 2024

ebsmothers Oct 9, 2024

Choose a reason for hiding this comment

ebsmothers Oct 9, 2024

Choose a reason for hiding this comment

RdoubleA Oct 9, 2024

Choose a reason for hiding this comment

ebsmothers Oct 9, 2024

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 10, 2024

RdoubleA commented Sep 23, 2024 •

edited

Loading

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading

felipemello1 Sep 24, 2024 •

edited

Loading