Releases: google/budoux
v0.6.3
What's Changed
- Show failing sentences for quality test by @tushuhei in #453
- Japanese model improvement by @tushuhei in #454
- Bundle budoux-th Web Components by @tushuhei in #452
- Support new CSS
display
syntax by @kojiishi in #483 - Remove Node.js 16 support and add Node.js 22 support by @tushuhei in #484
- Replace WBR with ZWSP in demo page by @tushuhei in #494
- Sort ICU format output by key by @tushuhei in #495
- Update JS README about Web Workers by @tushuhei in #509
- [security] Include DOMPurify in the demo bundle by @tushuhei in #658
- [nodejs] Override with [email protected] to suppress punycode deprecation warning by @tushuhei in #657
- [demo] Bind the input content and the query param by @tushuhei in #656
- Migrate to eslint flat config using @eslint/migrate-config by @tushuhei in #673
- Mention Korean support in README.md by @tushuhei in #701
- Correct a small typo/missing word in README.md by @adamsilverstein in #746
- [Java] Handle comment nodes by @tushuhei in #764
- [Java] Skip node at the end of input by @tushuhei in #765
New Contributors
- @adamsilverstein made their first contribution in #746
Full Changelog: v0.6.2...v0.6.3
v0.6.2
Thai is now supported! 🎉
What's Changed
- Add the scale argument to encode_data.py by @tushuhei in #408
- Nit fix for an ignored test by @tushuhei in #407
- Ja model improvement by @tushuhei in #410
- Add granularity option to prepare_knbc.py by @tushuhei in #417
- Add Thai language support by @tushuhei in #421
- Improve typing by @amitmarkel in #426
- Update README for Thai support by @tushuhei in #429
- Rename @returns to @return by @tushuhei in #415
New Contributors
- @amitmarkel made their first contribution in #426
Full Changelog: v0.6.1...v0.6.2
v0.6.1
What's Changed
- Bump @typescript-eslint/eslint-plugin from 6.9.1 to 6.10.0 in /javascript by @dependabot in #353
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.2.1 to 3.2.2 in /java by @dependabot in #354
- Bump actions/dependency-review-action from 3.1.1 to 3.1.2 by @dependabot in #357
- Bump @types/node from 20.8.3 to 20.9.0 in /javascript by @dependabot in #356
- Support weighted samples by @tushuhei in #358
- Fix unpaired close tags and self-closing tags by @kojiishi in #360
- [Java] Stop emitting close tags if self-closing by @kojiishi in #362
- Update Google Java Format action by @tushuhei in #363
- Bump actions/dependency-review-action from 3.1.2 to 3.1.3 by @dependabot in #364
- Bump @typescript-eslint/eslint-plugin from 6.10.0 to 6.11.0 in /javascript by @dependabot in #365
- [java] Fix errors by collapsed white spaces and
<br>
by @kojiishi in #367 - Bump github/codeql-action from 2.22.5 to 2.22.6 by @dependabot in #368
- [java] Replace
wholeText()
withNodeVisitor
by @kojiishi in #369 - Implement tail for node visitor by @tushuhei in #370
- Update jsoup to 1.16.2 by @tushuhei in #371
- Version up to 0.6.1 by @tushuhei in #372
Full Changelog: v0.6.0...v0.6.1
v0.6.0
Noteworthy changes
- BudouX Web Components don't use Shadow DOM anymore. The segmentation results will be reflected in their Light DOM, where the global styles can apply. #291
- Phrases are segmented by ZWSP (U+200B) not
<wbr>
for a better screen reader experience. #346 - You can insert non-breaking markup (
<nobr
andwhite-space: nowrap
) when you have a phrase you don't want to break. #240
What's Changed
- Remove dependency to gts by @tushuhei in #187
- Add
Parser.parseBoundaries
for JavaScript by @kojiishi in #234 - Replace
slice
withsubstring
by @kojiishi in #241 - Support non-breaking content (
<nobr>
andwhite-space: nowrap
) by @kojiishi in #240 - Make scripts run without install by @tushuhei in #239
- Add permissions to style check action by @tushuhei in #246
- Specify maxsplit to handle colon symbols properly by @tushuhei in #247
- Support non-breaking content in java by @kojiishi in #248
- Support non-breaking content in Python by @kojiishi in #251
- Nit: use get_nowait instead of get by @tushuhei in #253
- Remove utils from JavaScript module by @tushuhei in #262
- Move hasChildTextNode to HTML Processor by @tushuhei in #274
- Fix mypy issues by @tushuhei in #308
- Fix Python dependency issues by @tushuhei in #316
- Avoid inserting separators to where the source has one by @kojiishi in #342
- [Web Components] Use Light DOM instead of Shadow DOM by @tushuhei in #291
- Use ZWSP instead of WBR by @tushuhei in #346
- [Java] Use ArrayDeque instead of Stack by @tushuhei in #349
- Rename applyElement to applyToElement by @tushuhei in #348
- Update README to use ZWSP by @tushuhei in #347
- Version up to 0.6.0 by @tushuhei in #343
Full Changelog: v0.5.2...v0.6.0
v0.5.2
What's Changed
- Use overflow-wrap: anywhere; instead of overflow-wrap: break-word; by @tamanyan in #144
- Add a script to finetune models. by @tushuhei in #145
- Add quality regression test by @tushuhei in #146
- Release finetuned model by @tushuhei in #147 #154 #161
- Add validation data arg to train.py by @tushuhei in #148
- Remove direct dependency to NumPy by @tushuhei in #149
- Add a README for BudouX Scripts by @tushuhei in #155
- Add score scale arg to build_model.py by @tushuhei in #156
- Separate HTML processing as a mixin by @tushuhei in #159
New Contributors
- @step-security-bot made their first contribution in #163
Full Changelog: v0.5.1...v0.5.2
v0.5.1
What's Changed
- Add Java module by @tushuhei in #124
- Separate HTML processing as html_processor.py by @tushuhei in #126
- Fix bug with nodes to skip by @tushuhei in #127
- Rename test_utils.py to utils.py by @tushuhei in #129
- Remove test utils by @tushuhei in #130
- Universal unit testing by @tushuhei in #125
- Replace textarea with another skip node by @tushuhei in #131
- Java style fix by @tushuhei in #132
- Java code improvement by @tushuhei in #133
- Java style fix by @tushuhei in #134
- Fix mypy issue by @tushuhei in #135
- [Java] Inherit from sonatype oss parent by @tushuhei in #136
- Improve KNBC HTML Parser by @tushuhei in #137
Full Changelog: v0.5.0...v0.5.1
v0.5.0
Highlights
- No major change in using default parsers.
- If you're using a custom model, you need to update it. Read on the "Updating Models" section.
- The
defineClassAs
method injavascript/src/html_processor.ts
is removed.
Updating Models
As described in #112, the model file structure has been updated for performance improvement and file size reduction. The change is simple; it just adds one layer depth by grouping features as the following example shows.
Before:
{"UW1:a": 123, "UW3:b": 271}
After:
{"UW1": {"a": 123}, "UW3": {"b": 271}}
You can update your custom model to the latest by running scripts/translate_model.py.
$ python translate_model.py --format=json old-model.json > new-model.json
What's Changed
- Nit fix on some test descriptions by @tushuhei in #109
- Delete unused tsconfig by @tushuhei in #110
- Add unit test for Web Components by @tushuhei in #111
- Update the model structure for faster processing by @tushuhei in #112
- Refactor feature_extractor by @tushuhei in #113
- Use tempfile for unit test by @tushuhei in #114
- Add model translator for ICU by @tushuhei in #115
- Add a model format updater by @tushuhei in #117
- Remove defineClassAs function by @tushuhei in #119
- Remove unnecessary assertion by @tushuhei in #118
- Remove skip nodes data from JS by @tushuhei in #120
- Update the Prepare KNBC script to break chunks by specified sequences by @tushuhei in #121
- Update JA model by @tushuhei in #122
- Version Bump to 0.5.0 by @tushuhei in #123
Full Changelog: v0.4.1...v0.5.0
v0.4.1
⚠️ Breaking Change ⚠️
We added a significant change to the model training script scripts/train.py
.
- The
--chunk-size
option is removed because the bottleneck of memory consumption has shifted due to the overhaul. - The script does not shuffle the input data any more. You need to shuffle the data by yourself using tools such as
shuf
if needed.
What's Changed
- Faster training with sparse matrix by @tushuhei in #103
- Add
lang
option to JS CLI by @tushuhei in #102 - Bump json5 from 2.2.1 to 2.2.3 in /javascript by @dependabot in #104
- Cleanup the training script by @tushuhei in #105
- More accurate Japanese model by @tushuhei in #106
Full Changelog: v0.4.0...v0.4.1
v0.4.0
v0.3.0
What's Changed
Faster model training
We made model training faster by applying JAX's JIT compilation, pooling file writes, etc.
- Faster training data encoding by @tushuhei in #89
- Add out_span option for better GPU utilization by @tushuhei in #90
- Apply JAX JIT compiling for faster training by @tushuhei in #95
- Check in updated Simplified Chinese model by @tushuhei in #99
Smaller models
We made models smaller by removing less important features, disabling ASCII encoding, etc.
- Remove Unicode Block features by @tushuhei in #86
- Disable ASCII encoding when building the model file by @tushuhei in #98
- Output compact model by @tushuhei in #100
Misc
- encode_data: write without break line join by @tushuhei in #91
- Update unit tests for the encoding script by @tushuhei in #92
- Add more granularity in weight outputs by @tushuhei in #93
- Remove tar module dependency by @tushuhei in #96
Full Changelog: v0.2.1...v0.3.0