[tokenizers] support stride in tokenizers #2006

siddvenk · 2022-09-09T20:45:28Z

Description

This PR adds support for utilizing the stride parameter in truncation. Users can now pass in the stride value and get the overflowing encodings. See #1996

One thought I had is whether we should add a flag to the tokenizer that indicates whether a user wants overflows returned. I'm not sure how often these overflows are used in practice, but I imagine there are cases where tokenization results in overflow but the user doesn't care about them. We could avoid JNI calls in these situations by adding a flag.

support stride in tokenizers

c07a38e

siddvenk requested review from zachgk and frankfliu as code owners September 9, 2022 20:45

frankfliu approved these changes Sep 9, 2022

View reviewed changes

siddvenk merged commit c974939 into deepjavalibrary:master Sep 9, 2022

siddvenk deleted the tokenizer-stride branch September 9, 2022 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizers] support stride in tokenizers #2006

[tokenizers] support stride in tokenizers #2006

siddvenk commented Sep 9, 2022 •

edited

Loading

[tokenizers] support stride in tokenizers #2006

[tokenizers] support stride in tokenizers #2006

Conversation

siddvenk commented Sep 9, 2022 • edited Loading

Description

siddvenk commented Sep 9, 2022 •

edited

Loading