Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: added null support to spm_precompiled #3

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vicantwin
Copy link

adds null support to spm_precompiled based on huggingface/tokenizers#1627
also made some changes to test.rs to move the big binary and json into separate files.

This pull request includes changes to enhance the handling of optional precompiled character maps in the Precompiled struct. The most important changes involve modifying the serialization and deserialization methods, updating the struct definitions, and adjusting the implementation of the from method.

Handling Optional Precompiled Character Maps:

  • src/lib.rs: Changed precompiled_charsmap from Vec<u8> to Option<Vec<u8>> in both Precompiled and PrecompiledDeserializer structs. [1] [2]
  • src/lib.rs: Updated as_base64 and from_base64 functions to handle Option<T> types, ensuring proper serialization and deserialization of optional values.
  • src/lib.rs: Modified the from method in the Precompiled implementation to handle Option<&[u8]>, providing a default value when None is encountered.

@ArthurZucker
Copy link

Before I can review, could you fix the diff issue? test.rs seems to have a LOT of removed code that I can't see!

@vicantwin
Copy link
Author

vicantwin commented Oct 10, 2024

The diff issue is caused by removing about 15,000 lines of u8 values that were previously part of test.rs.
The original file had around 232 KB of binary data, so I moved it into nmt_nfkc.bin and simply load it in the test now.

Unfortunately, there's no way to 'fix' the diff without re-adding those 15,000 lines.
But let me know if you want me to revert it back.

@vicantwin
Copy link
Author

I've made the file-moving 5bb55be and null support a9abd29 into two different commits. Try checking the diff for a specific commit.

@vicantwin
Copy link
Author

@ArthurZucker any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants