Releases · google/budoux

22 Oct 07:09

tushuhei

v0.6.3

27fd3bb

v0.6.3 Latest

Latest

What's Changed

Show failing sentences for quality test by @tushuhei in #453
Japanese model improvement by @tushuhei in #454
Bundle budoux-th Web Components by @tushuhei in #452
Support new CSS display syntax by @kojiishi in #483
Remove Node.js 16 support and add Node.js 22 support by @tushuhei in #484
Replace WBR with ZWSP in demo page by @tushuhei in #494
Sort ICU format output by key by @tushuhei in #495
Update JS README about Web Workers by @tushuhei in #509
[security] Include DOMPurify in the demo bundle by @tushuhei in #658
[nodejs] Override with [email protected] to suppress punycode deprecation warning by @tushuhei in #657
[demo] Bind the input content and the query param by @tushuhei in #656
Migrate to eslint flat config using @eslint/migrate-config by @tushuhei in #673
Mention Korean support in README.md by @tushuhei in #701
Correct a small typo/missing word in README.md by @adamsilverstein in #746
[Java] Handle comment nodes by @tushuhei in #764
[Java] Skip node at the end of input by @tushuhei in #765

New Contributors

@adamsilverstein made their first contribution in #746

Full Changelog: v0.6.2...v0.6.3

Contributors

tushuhei, kojiishi, and adamsilverstein

Assets 2

12 Jan 01:36

tushuhei

v0.6.2

4b0f8c5

v0.6.2

Thai is now supported! 🎉

What's Changed

Add the scale argument to encode_data.py by @tushuhei in #408
Nit fix for an ignored test by @tushuhei in #407
Ja model improvement by @tushuhei in #410
Add granularity option to prepare_knbc.py by @tushuhei in #417
Add Thai language support by @tushuhei in #421
Improve typing by @amitmarkel in #426
Update README for Thai support by @tushuhei in #429
Rename @returns to @return by @tushuhei in #415

New Contributors

@amitmarkel made their first contribution in #426

Full Changelog: v0.6.1...v0.6.2

Contributors

tushuhei, amitmarkel, and 2 other contributors

Assets 4

17 Nov 06:34

tushuhei

v0.6.1

d02254f

v0.6.1

What's Changed

Bump @typescript-eslint/eslint-plugin from 6.9.1 to 6.10.0 in /javascript by @dependabot in #353
Bump org.apache.maven.plugins:maven-surefire-plugin from 3.2.1 to 3.2.2 in /java by @dependabot in #354
Bump actions/dependency-review-action from 3.1.1 to 3.1.2 by @dependabot in #357
Bump @types/node from 20.8.3 to 20.9.0 in /javascript by @dependabot in #356
Support weighted samples by @tushuhei in #358
Fix unpaired close tags and self-closing tags by @kojiishi in #360
[Java] Stop emitting close tags if self-closing by @kojiishi in #362
Update Google Java Format action by @tushuhei in #363
Bump actions/dependency-review-action from 3.1.2 to 3.1.3 by @dependabot in #364
Bump @typescript-eslint/eslint-plugin from 6.10.0 to 6.11.0 in /javascript by @dependabot in #365
[java] Fix errors by collapsed white spaces and <br> by @kojiishi in #367
Bump github/codeql-action from 2.22.5 to 2.22.6 by @dependabot in #368
[java] Replace wholeText() with NodeVisitor by @kojiishi in #369
Implement tail for node visitor by @tushuhei in #370
Update jsoup to 1.16.2 by @tushuhei in #371
Version up to 0.6.1 by @tushuhei in #372

Full Changelog: v0.6.0...v0.6.1

Contributors

tushuhei, kojiishi, and dependabot

Assets 2

06 Nov 22:02

tushuhei

v0.6.0

93ac23c

v0.6.0

Noteworthy changes

BudouX Web Components don't use Shadow DOM anymore. The segmentation results will be reflected in their Light DOM, where the global styles can apply. #291
Phrases are segmented by ZWSP (U+200B) not <wbr> for a better screen reader experience. #346
You can insert non-breaking markup (<nobr and white-space: nowrap) when you have a phrase you don't want to break. #240

What's Changed

Remove dependency to gts by @tushuhei in #187
Add Parser.parseBoundaries for JavaScript by @kojiishi in #234
Replace slice with substring by @kojiishi in #241
Support non-breaking content (<nobr> and white-space: nowrap) by @kojiishi in #240
Make scripts run without install by @tushuhei in #239
Add permissions to style check action by @tushuhei in #246
Specify maxsplit to handle colon symbols properly by @tushuhei in #247
Support non-breaking content in java by @kojiishi in #248
Support non-breaking content in Python by @kojiishi in #251
Nit: use get_nowait instead of get by @tushuhei in #253
Remove utils from JavaScript module by @tushuhei in #262
Move hasChildTextNode to HTML Processor by @tushuhei in #274
Fix mypy issues by @tushuhei in #308
Fix Python dependency issues by @tushuhei in #316
Avoid inserting separators to where the source has one by @kojiishi in #342
[Web Components] Use Light DOM instead of Shadow DOM by @tushuhei in #291
Use ZWSP instead of WBR by @tushuhei in #346
[Java] Use ArrayDeque instead of Stack by @tushuhei in #349
Rename applyElement to applyToElement by @tushuhei in #348
Update README to use ZWSP by @tushuhei in #347
Version up to 0.6.0 by @tushuhei in #343

Full Changelog: v0.5.2...v0.6.0

Contributors

tushuhei and kojiishi

Assets 2

03 Jul 05:46

tushuhei

v0.5.2

66f13b6

v0.5.2

What's Changed

Use overflow-wrap: anywhere; instead of overflow-wrap: break-word; by @tamanyan in #144
Add a script to finetune models. by @tushuhei in #145
Add quality regression test by @tushuhei in #146
Release finetuned model by @tushuhei in #147 #154 #161
Add validation data arg to train.py by @tushuhei in #148
Remove direct dependency to NumPy by @tushuhei in #149
Add a README for BudouX Scripts by @tushuhei in #155
Add score scale arg to build_model.py by @tushuhei in #156
Separate HTML processing as a mixin by @tushuhei in #159

New Contributors

@step-security-bot made their first contribution in #163

Full Changelog: v0.5.1...v0.5.2

Contributors

tushuhei, tamanyan, and step-security-bot

Assets 2

20 Apr 02:56

tushuhei

v0.5.1

50178f6

v0.5.1

What's Changed

Add Java module by @tushuhei in #124
Separate HTML processing as html_processor.py by @tushuhei in #126
Fix bug with nodes to skip by @tushuhei in #127
Rename test_utils.py to utils.py by @tushuhei in #129
Remove test utils by @tushuhei in #130
Universal unit testing by @tushuhei in #125
Replace textarea with another skip node by @tushuhei in #131
Java style fix by @tushuhei in #132
Java code improvement by @tushuhei in #133
Java style fix by @tushuhei in #134
Fix mypy issue by @tushuhei in #135
[Java] Inherit from sonatype oss parent by @tushuhei in #136
Improve KNBC HTML Parser by @tushuhei in #137

Full Changelog: v0.5.0...v0.5.1

Contributors

tushuhei

Assets 2

01 Mar 03:05

tushuhei

v0.5.0

638b82b

v0.5.0

Highlights

No major change in using default parsers.
If you're using a custom model, you need to update it. Read on the "Updating Models" section.
The defineClassAs method in javascript/src/html_processor.ts is removed.

Updating Models

As described in #112, the model file structure has been updated for performance improvement and file size reduction. The change is simple; it just adds one layer depth by grouping features as the following example shows.

Before:

{"UW1:a": 123, "UW3:b": 271}

After:

{"UW1": {"a": 123}, "UW3": {"b": 271}}

You can update your custom model to the latest by running scripts/translate_model.py.

$ python translate_model.py --format=json old-model.json > new-model.json

What's Changed

Nit fix on some test descriptions by @tushuhei in #109
Delete unused tsconfig by @tushuhei in #110
Add unit test for Web Components by @tushuhei in #111
Update the model structure for faster processing by @tushuhei in #112
Refactor feature_extractor by @tushuhei in #113
Use tempfile for unit test by @tushuhei in #114
Add model translator for ICU by @tushuhei in #115
Add a model format updater by @tushuhei in #117
Remove defineClassAs function by @tushuhei in #119
Remove unnecessary assertion by @tushuhei in #118
Remove skip nodes data from JS by @tushuhei in #120
Update the Prepare KNBC script to break chunks by specified sequences by @tushuhei in #121
Update JA model by @tushuhei in #122
Version Bump to 0.5.0 by @tushuhei in #123

Full Changelog: v0.4.1...v0.5.0

Contributors

tushuhei

Assets 2

12 Jan 23:02

tushuhei

v0.4.1

c1d8199

v0.4.1

⚠️ Breaking Change ⚠️

We added a significant change to the model training script scripts/train.py.

The --chunk-size option is removed because the bottleneck of memory consumption has shifted due to the overhaul.
The script does not shuffle the input data any more. You need to shuffle the data by yourself using tools such as shuf if needed.

What's Changed

Faster training with sparse matrix by @tushuhei in #103
Add lang option to JS CLI by @tushuhei in #102
Bump json5 from 2.2.1 to 2.2.3 in /javascript by @dependabot in #104
Cleanup the training script by @tushuhei in #105
More accurate Japanese model by @tushuhei in #106

Full Changelog: v0.4.0...v0.4.1

Contributors

tushuhei and dependabot

Assets 2

14 Dec 03:52

tushuhei

v0.4.0

ad169e0

v0.4.0

What's Changed

Traditional Chinese support by @tushuhei in #101

Full Changelog: v0.3.0...v0.4.0

Contributors

tushuhei

Assets 2

05 Dec 01:29

tushuhei

v0.3.0

b22226f

v0.3.0

What's Changed

Faster model training

We made model training faster by applying JAX's JIT compilation, pooling file writes, etc.

Faster training data encoding by @tushuhei in #89
Add out_span option for better GPU utilization by @tushuhei in #90
Apply JAX JIT compiling for faster training by @tushuhei in #95
Check in updated Simplified Chinese model by @tushuhei in #99

Smaller models

We made models smaller by removing less important features, disabling ASCII encoding, etc.

Remove Unicode Block features by @tushuhei in #86
Disable ASCII encoding when building the model file by @tushuhei in #98
Output compact model by @tushuhei in #100

Misc

encode_data: write without break line join by @tushuhei in #91
Update unit tests for the encoding script by @tushuhei in #92
Add more granularity in weight outputs by @tushuhei in #93
Remove tar module dependency by @tushuhei in #96

Full Changelog: v0.2.1...v0.3.0

Contributors

tushuhei

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Noteworthy changes

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Highlights

Updating Models

What's Changed

Contributors

⚠️ Breaking Change ⚠️

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Faster model training

Smaller models

Misc

Contributors

Releases: google/budoux

v0.6.3

What's Changed

New Contributors

Contributors

v0.6.2

What's Changed

New Contributors

Contributors

v0.6.1

What's Changed

Contributors

v0.6.0

Noteworthy changes

What's Changed

Contributors

v0.5.2

What's Changed

New Contributors

Contributors

v0.5.1

What's Changed

Contributors

v0.5.0

Highlights

Updating Models

What's Changed

Contributors

v0.4.1

⚠️ Breaking Change ⚠️

What's Changed

Contributors

v0.4.0

What's Changed

Contributors

v0.3.0

What's Changed

Faster model training

Smaller models

Misc

Contributors