Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final tokenizer's cleanup #7291

Merged
merged 1 commit into from
Nov 8, 2024
Merged

Final tokenizer's cleanup #7291

merged 1 commit into from
Nov 8, 2024

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Nov 8, 2024

No description provided.

Copy link

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 79.38312% with 127 lines in your changes missing coverage. Please review.

Project coverage is 68.84%. Comparing base (5c50319) to head (31b97b8).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 57.30% 48 Missing and 28 partials ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 67.24% 10 Missing and 9 partials ⚠️
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 60.00% 7 Missing and 5 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 85.71% 4 Missing and 1 partial ⚠️
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs 0.00% 4 Missing ⚠️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 75.00% 4 Missing ⚠️
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 89.65% 2 Missing and 1 partial ⚠️
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 90.90% 1 Missing ⚠️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 83.33% 0 Missing and 1 partial ⚠️
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs 98.93% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7291      +/-   ##
==========================================
- Coverage   68.87%   68.84%   -0.04%     
==========================================
  Files        1467     1473       +6     
  Lines      273954   274159     +205     
  Branches    28380    28420      +40     
==========================================
+ Hits       188685   188737      +52     
- Misses      77961    78112     +151     
- Partials     7308     7310       +2     
Flag Coverage Δ
Debug 68.84% <79.38%> (-0.04%) ⬇️
production 63.29% <69.21%> (-0.04%) ⬇️
test 89.18% <99.04%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs 100.00% <100.00%> (ø)
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs 59.09% <ø> (ø)
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs 78.28% <100.00%> (ø)
.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs 100.00% <100.00%> (ø)
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs 62.85% <100.00%> (ø)
...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs 87.23% <100.00%> (ø)
src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs 91.10% <100.00%> (ø)
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 100.00% <100.00%> (ø)
...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs 100.00% <100.00%> (ø)
... and 16 more

... and 14 files with indirect coverage changes

@tarekgh tarekgh merged commit 8611211 into dotnet:main Nov 8, 2024
25 checks passed
@tarekgh
Copy link
Member Author

tarekgh commented Nov 8, 2024

/backport to release/4.0

Copy link

github-actions bot commented Nov 8, 2024

Started backporting to release/4.0: https://github.com/dotnet/machinelearning/actions/runs/11747721749

Copy link

github-actions bot commented Nov 8, 2024

@tarekgh backporting to release/4.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: Final tokenizer's cleanup
Using index info to reconstruct a base tree...
M	src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
M	src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
M	src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs
M	src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
M	test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs
Falling back to patching base and 3-way merge...
Auto-merging test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs
Auto-merging src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
CONFLICT (content): Merge conflict in src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
CONFLICT (content): Merge conflict in src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Final tokenizer's cleanup
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link

github-actions bot commented Nov 8, 2024

@tarekgh an error occurred while backporting to release/4.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

tarekgh added a commit to tarekgh/machinelearning that referenced this pull request Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants